Diacritization for Real-World Arabic Texts
نویسندگان
چکیده
For Arabic, diacritizing written text is important for many NLP tasks. In the work presented here, we investigate the quality of a diacritization approach, with a high success rate for treebank data but with a more limited success on realworld data. One of the problems we encountered is the non-standard use of the hamza diacritic, which leads to a decrease in diacritization accuracy. If an automatic hamza restoration module precedes diacritization, the results improve from a word error rate of 9.20% to 7.38% in treebank data, and from 7.96% to 5.93% on selected real-world texts. This shows clearly that hamza restoration is a necessary step for improving diacritization quality for Arabic real-world texts.
منابع مشابه
Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems
Arabic diacritics are often missed in Arabic scripts. This feature is a handicap for new learner to read َArabic, text to speech conversion systems, reading and semantic analysis of Arabic texts. The automatic diacritization systems are the best solution to handle this issue. But such automation needs resources as diactritized texts to train and evaluate such systems. In this paper, we describe ...
متن کاملSmoothing methods for a morpho-statistical approach of automatic diacritization Arabic texts (Méthodes de lissage d'une approche morpho-statistique pour la voyellation automatique des textes arabes) [in French]
We present in this work a new approach for the Automatic diacritization for Arabic texts using three stages. During the first phase, we integrated a lexical database containing the most frequent words of Arabic with morphological analysis by Alkhalil Morpho Sys which provided possible diacritization for each word. The objective of the second module is to eliminate the ambiguity using a statisti...
متن کاملSHAKKIL: An Automatic Diacritization System for Modern Standard Arabic Texts
This paper sheds light on a system that would be able to diacritize Arabic texts automatically (SHAKKIL). In this system, the diacritization problem will be handled through two levels; morphological and syntactic processing levels. The adopted morphological disambiguation algorithm depends on four layers; Uni-morphological form layer, rule-based morphological disambiguation layer, statistical-b...
متن کاملDiacritization: A Challenge to Arabic Treebank Annotation and Parsing
Arabic diacritization (referred to sometimes as vocalization or vowelling), defined as the full or partial representation of short vowels, shadda (consonantal length or germination), tanween (nunation or definiteness), and hamza (the glottal stop and its support letters), is still largely understudied in the current NLP literature. In this paper, the lack of diacritics in standard Arabic texts ...
متن کاملArabic Diacritization in the Context of Statistical Machine Translation
Diacritics in Arabic are optional orthographic symbols typically representing short vowels. Most Arabic text is underspecified for diacritics. However, we do observe partial diacritization depending on genre and domain. In this paper, we investigate the impact of Arabic diacritization on statistical machine translation (SMT). We define several diacritization schemes ranging from full to partial...
متن کامل